Saving Data, and Tools for Large or Concurrent Jobs

Five use cases are considered:

saving output in common formats for sharing (CSV, Excel)
saving output in binary formats for further analysis (pickle, HDF5, SQL)
processing a large video, saving results one frame at a time
processing many videos in parallel
accessing partially complete results during analysis

Saving data

In the simplest case, you can locate the features in every frame of a movie, and output them to a variable.



In [4]:

    
import mr



In [10]:

    
v = mr.Video('/home/dallan/mr/mr/tests/water/bulk-water.mov')



In [12]:

    
f = mr.batch(v[:3], 11, 3000)









    



mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

The result is a DataFrame, which can be saved in formats convenient for sharing, like Excel



In [13]:

    
f.to_excel('features.xlsx')

or comma-separated values



In [14]:

    
f.to_csv('features.csv')

These formats are slow to read and write. If you not are not sending the file to a non-programmer, it is better to save it as a binary file.



In [37]:

    
f.save('features.df') # df for DataFrame -- could be any name you want

Saving large jobs while they run

For large jobs, it is better to save the features one frame at a time as the job proceeds. If the job is interrupted, partial progress will be saved. And the job requires only enough memory to process one frame at a time -- it need not hold all the frames' data.

batch can do this in two different ways: using an HDF5 file (a fast binary format) or a SQL database.

HDF5

For HDF5, we open an HDF5 file using pandas, and pass it to batch.



In [20]:

    
store = pd.HDFStore('data.h5')
f = mr.batch(v[:3], 11, 3000, store=store, table='bulk_water/features')
# table can take any unique name -- even slashes and spaces are OK









    



mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

batch saves the data one frame at a time, discarding each frame's data before it begins the next one. In this way, memory is conserved and long videos can be processed. At the end, batch loads the data out of the HDF5 file and returns it in the variable f.

In some cases, if you wish to run jobs simultaneous in several Python sessions, you might want to leave the data in store and retrieve it later, in part or in full. Use do_not_return=True.



In [22]:

    
mr.batch(v[:3], 11, 3000, store=store, table='bulk_water/features', do_not_return=True)
# This returns nothing.









    



mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

We can load it from the store later.



In [25]:

    
f = store['bulk_water/features']
f.head()

If it is too large, we can fetch it in part:



In [43]:

    
f = store.select('features', pd.Term('frame < 3'))
f.head()

SQL

As an alternative to HDF5, we can use a SQL database. The simplest choice is sqlite, which uses a single file to store a database.



In [33]:

    
import sqlite3
conn = sqlite3.connect('data.sql')
f = mr.batch(v[:3], 11, 3000, conn=conn, sql_flavor='sqlite', table='bulk_water/features')









    



mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

A MySQL database is also supported. The mr.sql module provides a convenience function for making a MySQL database connection.



In [32]:

    
f = mr.batch(v[:3], 11, 3000, conn=mr.sql.connect(), sql_flavor='mysql', table='bulk_water/features')









    



mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

As with HDF5, you can conserve memory using do_not_return=True.

Accesssing partial data sets without interrupting analysis

Finally, sometimes it is convenient examine the early results while the full video is still being processed. This is not possible with an HDF5 file, which does not support concurrent reading and writing. But SQL makes it possible.



In [36]:

    
partial = pd.io.sql.read_frame('select * from bulk_water_features', conn)
partial.head()

Here we have the full result because my short example job is done and already finished.

	x	y	mass	size	ecc	signal	ep
2	36.048647	8.120968	3844	2.771191	0.146392	22.279131	0.370538
3	67.232830	7.869478	3509	2.570799	0.053752	23.279131	0.412061
5	430.957784	7.319437	5685	2.763565	0.288109	26.279131	0.315874
6	629.180087	8.195757	4148	3.248655	0.216354	14.279131	0.420683
12	552.773313	11.108589	3260	2.211168	0.118502	29.279131	0.442856

	x	y	mass	size	ecc	signal	ep
2	36.048647	8.120968	3844	2.771191	0.146392	22.279131	0.370538
3	67.232830	7.869478	3509	2.570799	0.053752	23.279131	0.412061
5	430.957784	7.319437	5685	2.763565	0.288109	26.279131	0.315874
6	629.180087	8.195757	4148	3.248655	0.216354	14.279131	0.420683
12	552.773313	11.108589	3260	2.211168	0.118502	29.279131	0.442856

	x	y	mass	size	ecc	signal	ep
0	36.048647	8.120968	3844	2.771191	0.146392	22.279131	0.370538
1	67.232830	7.869478	3509	2.570799	0.053752	23.279131	0.412061
2	430.957784	7.319437	5685	2.763565	0.288109	26.279131	0.315874
3	629.180087	8.195757	4148	3.248655	0.216354	14.279131	0.420683
4	552.773313	11.108589	3260	2.211168	0.118502	29.279131	0.442856